Goto

Collaborating Authors

 mutational process


MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification

Dou, Yifan, Khadre, Adam, Petreaca, Ruben C, Mirzaei, Golrokh

arXiv.org Artificial Intelligence

Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping.


Population sequencing data reveal a compendium of mutational processes in the human germ line

Science

It has become increasing clear that mutation affects phenotypic variation and disease risk across humans. However, there are many different types of mutation. Seplyarskiy et al. applied a matrix factorization method to large human genomic datasets to identify germline mutational processes in an unsupervised manner. From this survey, nine robust mutational components were identified and specific mechanisms generating seven of these processes were proposed from correlations with genomic features. These results confirm and improve upon our understanding of mutational processes and reveal likely mechanisms of mutation in the human genome. Science , aba7408, this issue p. [1030][1] Biological mechanisms underlying human germline mutations remain largely unknown. We statistically decompose variation in the rate and spectra of mutations along the genome using volume-regularized nonnegative matrix factorization. The analysis of a sequencing dataset (TOPMed) reveals nine processes that explain the variation in mutation properties between loci. We provide a biological interpretation for seven of these processes. We associate one process with bulky DNA lesions that are resolved asymmetrically with respect to transcription and replication. Two processes track direction of replication fork and replication timing, respectively. We identify a mutagenic effect of active demethylation primarily acting in regulatory regions and a mutagenic effect of long interspersed nuclear elements. We localize a mutagenic process specific to oocytes from population sequencing data. This process appears transcriptionally asymmetric. [1]: /lookup/doi/10.1126/science.aba7408